Cross-Lingual Transfer with Order Differences

This repo contains the code and models for the NAACL19 paper: "On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing" [arxiv] [paper] [bib]

This is build based upon NeuroNLP2 and PyTorch-0.3.

Easy Running Steps

We prepare some easy-to-run example scripts.

Easy preparing

(Note): The data-preparation script requires Python3. (while the rest main running after requires Python2)
For easy preparing, simply run examples/run_more/go_data.sh. This is an one-step script to get and prepare all the data (might need much disk space, majorly for embeddings files).

Running Environment

This implementation should run in Python2 + Pytorch0.3, our suggestion is to use conda to install the required environment:
conda create -n myenv python=2.7; source activate myenv; conda install gensim;
conda install pytorch=0.3.1 cuda80 -c pytorch

Easy running

(Step 0): make a new DIR and cd into it, so that now data is at ../data2.2_more/ and source codes are at ../src/.
(Step 1): for building vocabs, run examples/run_more/go_vocab.sh.
(Step 2): for training (SelfAtt-Graph for example, see previous scripts for the settings of other models), run examples/run_more/go_train.sh.
(Step 3): for testing, run examples/run_more/go_test.sh.

Details

The rest provides more details for the steps of the runnings.

Data Preparation

The data format is basically CoNLL-U Fomat (here) in UD v2.2, but with some crucial differences:
Firstly, all comments (starts with #) and non-integer lines (multiword or empty tokens) should be removed.
Moreover, the POS are read from Column 5 instead of Column 4, a simple movement is needed.
Aligned cross-lingual embeddings are required for inputs. We use the old version of fasttext embeddings and fastText_multilingual for alignment.
Please refer to examples/run_more/prepare_data.py for the data preparation step.

Training and Testing

We provide exampling scripts for training and testing, please follow those examples (some of the paths in the scripts are specific to our environment, you may need to set up the correct paths).
Step 1: build dictionaries (see examples/run_more/prepare_vocab.sh). This step will build the vocabs (use examples/vocab/build_joint_vocab_embed.py) for the source language together with source embeddings.
Step 2: train the models (see examples/run_more/train_*.sh) on the source language. Here, we have four types of models correspoding to those in our paper, the names are slightly different, here are the mappings: SelfAttGraph->train_graph.sh, RNNGraph->train_graph_rnn.sh, SelfAttStack->train_stptr.sh, RNNStack->train_stptr_rnn.sh.
Extra: for these scripts, the file paths should be changed to the correct ones: --word_path for embedding file, --train --dev --test for corresponding data files.
Step 3: testing with the trained models (see examples/run_more/run_analyze.sh). Also, the paths for extra language data (--test) and extra language embeddings (--extra_embed) should be set correspondingly.
Our trained models (English as source, 5 different random runs) can be found here.
Warning: the embeddings of zh and ja are not well aligned, and our paper reports de-lexicalized results, which can be obtained by adding the flag --no_word both for training and testing.
Warning2: the outputs do not keep the original ordering of the input file, and are sorted by sentence length. Both the system output and gold parses in the new ordering are outputted (*_pred, *_gold).

Citation

If you find this repo useful, please cite our paper.

@inproceedings{ahmad-etal-2019-difficulties,
    title = "On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on 
            Dependency Parsing",
    author = "Ahmad, Wasi  and Zhang, Zhisong  and Ma, Xuezhe  and
            Hovy, Eduard  and Chang, Kai-Wei  and Peng, Nanyun",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for 
            Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    year = "2019",
    pages = "2440--2452"
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
examples		examples
neuronlp2		neuronlp2
LICENSE		LICENSE
README.md		README.md
distance.py		distance.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

neuronlp2

neuronlp2

LICENSE

LICENSE

README.md

README.md

distance.py

distance.py

Repository files navigation

Cross-Lingual Transfer with Order Differences

Easy Running Steps

Easy preparing

Running Environment

Easy running

Details

Data Preparation

Training and Testing

Citation

About

Releases

Packages

Contributors 2

Languages

License

uclanlp/CrossLingualDepParser

Folders and files

Latest commit

History

Repository files navigation

Cross-Lingual Transfer with Order Differences

Easy Running Steps

Easy preparing

Running Environment

Easy running

Details

Data Preparation

Training and Testing

Citation

About

Resources

License

Stars

Watchers

Forks

Languages